How to run whisper as onnx?

I converted huggignface whisper model to onnx with optimum-cli:

optimum-cli export onnx --model openai/whisper-small.en  whispersmallen

I got 4 onnx files:

decoder_model_merged.onnx
decoder_model.onnx
decoder_with_past_model.onnx
encoder_model.onnx

Now I want to write code which loads whisper (as onnx) and run it on 1.wav file.

  • How to do it ?
  • When using hf whisper model, I just run one model (and not 2 sperates models: encoder/decdoer)
1 Like

:wrench: 1. Install Required Libraries

pip install onnxruntime librosa transformers numpy

:headphone: 2. Preprocess Audio into Log-Mel Spectrogram

import numpy as np
import librosa
from transformers import WhisperFeatureExtractor

# Load audio
audio, sr = librosa.load("1.wav", sr=16000)
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small.en",sampling_rate=16000)

# Convert to log-mel spectrogram
inputs = feature_extractor(audio, return_tensors="np",sampling_rate=16000)
input_features = inputs["input_features"]  # shape: (1, 80, 3000)

:brain: 3. Load ONNX Encoder and Run It

import onnxruntime as ort

# Load encoder
encoder_sess = ort.InferenceSession("whispersmallen/encoder_model.onnx")

# Run encoder
encoder_outputs = encoder_sess.run(
    output_names=["last_hidden_state"],
    input_feed={"input_features": input_features}
)[0]

:repeat_button: 4. Autoregressive Decoding Loop

Whisper uses decoder input tokens (decoder_input_ids) and the encoder_hidden_states to generate tokens one by one.

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small.en")

decoder_sess = ort.InferenceSession("whispersmallen/decoder_model.onnx")

# Start with <|startoftranscript|>
decoder_input_ids = np.array([[tokenizer.convert_tokens_to_ids("<|startoftranscript|>")]], dtype=np.int64)

generated_ids = []

for _ in range(100):  # max 100 tokens
    outputs = decoder_sess.run(
        output_names=["logits"],
        input_feed={
            "input_ids": decoder_input_ids,
            "encoder_hidden_states": encoder_outputs
        }
    )
    
    next_token_logits = outputs[0][:, -1, :]  # shape: (1, vocab_size)
    next_token_id = np.argmax(next_token_logits, axis=-1)[0]
    
    if next_token_id == tokenizer.eos_token_id:
        break

    generated_ids.append(next_token_id)
    decoder_input_ids = np.append(decoder_input_ids, [[next_token_id]], axis=-1)

:memo: 5. Decode Output

transcription = tokenizer.decode(generated_ids, skip_special_tokens=True)
print("Transcription:", transcription)

:white_check_mark: Summary

  • You now need to explicitly handle encoder and decoder.
  • ONNX does not wrap both in one model.
  • The decoder loop is autoregressive: it feeds its output token as input in the next step.
  • Pre/postprocessing can still use HuggingFace.

ChatGPT provided this, and I tested it — it works.

1 Like